Project Description¶
We are pretending to be Data Scientists working with a car insurance company. They want our help to predict whether a customer will make a claim on their car insurance during the policy period.
Car insurance is a huge business, and companies spend a lot of time and money trying to guess who is more likely to file a claim. This helps them set fair prices and manage risk better.
In this project, the company wants to start simple. They don’t have advanced tools yet, so they asked us to:
- Find one feature (column) from the data that gives the most accurate predictions.
- Use that feature to build a logistic regression model (a type of machine learning model).
- Measure performance using accuracy.
Goal¶
Our job is to:
- Build several models, each using only one feature at a time.
- Compare how well they predict whether someone will make a claim or not.
- Tell the company which single feature works best.
The Dataset¶
The file is called car_insurance.csv
and contains information about different customers. The last column, outcome
, shows whether the customer made a claim (1) or not (0).
Dataset Columns¶
Column | Description |
---|---|
id |
Unique ID for each customer. |
age |
Customer age group:0 : 16–25, 1 : 26–39, 2 : 40–64, 3 : 65+ |
gender |
0 : Female, 1 : Male |
driving_experience |
Years of driving:0 : 0–9, 1 : 10–19, 2 : 20–29, 3 : 30+ |
education |
0 : No education, 1 : High school, 2 : University |
income |
0 : Poverty, 1 : Working class, 2 : Middle class, 3 : Upper class |
credit_score |
Score between 0 and 1 (higher is better) |
vehicle_ownership |
0 : Doesn’t own car, 1 : Owns car |
vehcile_year |
0 : Before 2015, 1 : 2015 or later |
married |
0 : Not married, 1 : Married |
children |
Number of children |
postal_code |
Area code (not useful for prediction) |
annual_mileage |
How many miles they drive per year |
vehicle_type |
0 : Sedan, 1 : Sports car |
speeding_violations |
Total number of speeding tickets |
duis |
Number of DUI offenses (drunk driving) |
past_accidents |
Number of past car accidents |
outcome |
Target column — 1 : Made a claim, 0 : Did not make a claim |
Final Objective¶
Test every single feature in the dataset one at a time to find out:
- Which feature gives the highest prediction accuracy for the
outcome
? - This will help the company start simple with their machine learning strategy.
In [2]:
# Import required modules
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit
In [3]:
# Read in dataset
cars = pd.read_csv("car_insurance.csv")
# Check for missing values
cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10000 non-null int64 1 age 10000 non-null int64 2 gender 10000 non-null int64 3 driving_experience 10000 non-null object 4 education 10000 non-null object 5 income 10000 non-null object 6 credit_score 9018 non-null float64 7 vehicle_ownership 10000 non-null float64 8 vehicle_year 10000 non-null object 9 married 10000 non-null float64 10 children 10000 non-null float64 11 postal_code 10000 non-null int64 12 annual_mileage 9043 non-null float64 13 vehicle_type 10000 non-null object 14 speeding_violations 10000 non-null int64 15 duis 10000 non-null int64 16 past_accidents 10000 non-null int64 17 outcome 10000 non-null float64 dtypes: float64(6), int64(7), object(5) memory usage: 1.4+ MB
In [4]:
# Fill missing values with the mean
cars["credit_score"].fillna(cars["credit_score"].mean(), inplace=True)
cars["annual_mileage"].fillna(cars["annual_mileage"].mean(), inplace=True)
In [7]:
# Empty list to store model results
models = []
# Empty list to store accuracies
accuracies = []
# Feature columns
features = cars.drop(columns=["id", "outcome"]).columns
In [8]:
# Loop through features
for col in features:
# Create a model
model = logit(f"outcome ~ {col}", data=cars).fit()
# Add each model to the models list
models.append(model)
Optimization terminated successfully. Current function value: 0.511794 Iterations 6 Optimization terminated successfully. Current function value: 0.615951 Iterations 5 Optimization terminated successfully. Current function value: 0.467092 Iterations 8 Optimization terminated successfully. Current function value: 0.603742 Iterations 5 Optimization terminated successfully. Current function value: 0.531499 Iterations 6 Optimization terminated successfully. Current function value: 0.572557 Iterations 6 Optimization terminated successfully. Current function value: 0.552412 Iterations 5 Optimization terminated successfully. Current function value: 0.572668 Iterations 6 Optimization terminated successfully. Current function value: 0.586659 Iterations 5 Optimization terminated successfully. Current function value: 0.595431 Iterations 5 Optimization terminated successfully. Current function value: 0.617345 Iterations 5 Optimization terminated successfully. Current function value: 0.605716 Iterations 5 Optimization terminated successfully. Current function value: 0.621700 Iterations 5 Optimization terminated successfully. Current function value: 0.558922 Iterations 7 Optimization terminated successfully. Current function value: 0.598699 Iterations 6 Optimization terminated successfully. Current function value: 0.549220 Iterations 7
In [9]:
# Loop through models
for feature in range(0, len(models)):
# Compute the confusion matrix
conf_matrix = models[feature].pred_table()
# True negatives
tn = conf_matrix[0,0]
# True positives
tp = conf_matrix[1,1]
# False negatives
fn = conf_matrix[1,0]
# False positives
fp = conf_matrix[0,1]
# Compute accuracy
acc = (tn + tp) / (tn + fn + fp + tp)
accuracies.append(acc)
In [10]:
# Find the feature with the largest accuracy
best_feature = features[accuracies.index(max(accuracies))]
# Create best_feature_df
best_feature_df = pd.DataFrame({"best_feature": best_feature,
"best_accuracy": max(accuracies)},
index=[0])
best_feature_df
Out[10]:
best_feature | best_accuracy | |
---|---|---|
0 | driving_experience | 0.7771 |
In [ ]:
**driving_experience** is the best single feature for predicting the "crop" variable